Load the necessary libraries
library(car) # for regression diagnostics
library(broom) # for tidy output
library(ggfortify) # for model diagnostics
library(sjPlot) # for outputs
library(knitr) # for kable
library(effects) # for partial effects plots
library(emmeans) # for estimating marginal means
library(ggeffects) # for partial effects plots
library(modelr) # for auxillary modelling functions
library(DHARMa) # for residual diagnostics plots
library(performance) # for residuals diagnostics
library(see) # for plotting residuals
library(tidyverse) # for data wrangling
Polis et al. (1998) were interested in modelling the presence/absence of lizards (Uta sp.) against the perimeter to area ratio of 19 islands in the Gulf of California.
Uta lizard
Format of polis.csv data file
| ISLAND | RATIO | PA |
|---|---|---|
| Bota | 15.41 | 1 |
| Cabeza | 5.63 | 1 |
| Cerraja | 25.92 | 1 |
| Coronadito | 15.17 | 0 |
| .. | .. | .. |
| ISLAND | Categorical listing of the name of the 19 islands used - variable not used in analysis. |
| RATIO | Ratio of perimeter to area of the island. |
| PA | Presence (1) or absence (0) of Uta lizards on island. |
The aim of the analysis is to investigate the relationship between island perimeter to area ratio and the presence/absence of Uta lizards.
polis <- read_csv("../public/data/polis.csv", trim_ws = TRUE)
glimpse(polis)
## Rows: 19
## Columns: 3
## $ ISLAND <chr> "Bota", "Cabeza", "Cerraja", "Coronadito", "Flecha", "Gemelose"…
## $ RATIO <dbl> 15.41, 5.63, 25.92, 15.17, 13.04, 18.85, 30.95, 22.87, 12.01, 1…
## $ PA <dbl> 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1
head(polis)
str(polis)
## spec_tbl_df [19 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ISLAND: chr [1:19] "Bota" "Cabeza" "Cerraja" "Coronadito" ...
## $ RATIO : num [1:19] 15.41 5.63 25.92 15.17 13.04 ...
## $ PA : num [1:19] 1 1 1 0 1 0 0 0 0 1 ...
## - attr(*, "spec")=
## .. cols(
## .. ISLAND = col_character(),
## .. RATIO = col_double(),
## .. PA = col_double()
## .. )
Model formula: \[ y_i \sim{} \mathcal{Bin}(n, p_i)\\ ln\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_i \]
where \(y_i\) represents the \(i\) observed values, \(n\) represents the number of trials (in the case of logistic, this is always 1), \(p_i\) represents the probability of lizards being present in the \(i^{th}\) population, and \(\beta_0\) and \(\beta_1\) represent the intercept and slope respectively.
ggplot(polis, aes(y = PA, x = RATIO)) +
geom_point()
ggplot(polis, aes(y = PA, x = RATIO)) +
geom_point() +
geom_smooth(
method = "glm", formula = y ~ x,
method.args = list(family = "binomial")
)
polis.glm <- glm(PA ~ RATIO, family = binomial(link = "logit"), data = polis)
polis$Total <- 1
polis.glm1 <- glm(PA ~ RATIO, family = binomial(link = "logit"), data = polis, weights = Total)
polis$Total <- 1
polis.glm1 <- glm(cbind(PA, Total - PA) ~ RATIO, family = binomial(link = "logit"), data = polis)
Model validation routines draw heavily upon patterns in residuals. For OLS models, the residuals are fairly straight forward. They are the difference between the observed and predicted and are the property that are minimised during optimisation (least squares refers to minimising the sum of square residuals).
This is not the case for models based on maximum likelihood. Rather than minimising the sum of residuals, GLM’s maximise likelihood. Hence residuals are not directly calculated as part of the optimisation process. Nevertheless, residual patterns are useful diagnostics.
For GLM’s there are numerous different residual formulations:
| Residual type | Description | Function |
|---|---|---|
| Working | Residuals on the link scale transformed back to link scale. | residuals(mod, "working") |
(y - mu)/mu.eta(eta) |
||
| Deviance | Signed square-root of the deviance due to each observation. | residuals(mod, "deviance") |
| The sum of the deviance residuals is the total deviance. | ||
| Pearson | The difference between observed and expected scaled by the | residuals(mod, "pearson") |
| standard deviation so they are on the response scale | ||
| Partial | Working residuals added to the fitted values. Vector per | residuals(mod, "partial") |
| Predictor | ||
| Response | Raw residuals - difference between observed and expected | residuals(mod, "response") |
| These are only appropriate for Gaussian |
polis.glm %>% autoplot(which = 1:6, label.repel = TRUE)
## there does seem to be an outlier that is influential - obs #3
## Perhaps we should redo the scatterplot but with text so that we can see which obs is #3
Conclusions:
polis %>%
mutate(n = 1:nrow(.)) %>%
ggplot(aes(y = PA, x = RATIO)) +
geom_text(aes(label = n))
## it seems that Uta lizards were present on this island dispite it having
## a relatively large surface area to volume ratio.
## Perhaps this island:
## - was a long way from other islands (isolated)
## - not inhabited
## - was very large
## - some other reason why it was not typical of the population.
## If no, then we cannot simply exclude the point.
## In anycase, if anything, this obs is going to result in greater variability and
## a more concervative test - so it is not really harming to leave it in.
polis.glm %>% influence.measures()
## Influence measures of
## glm(formula = PA ~ RATIO, family = binomial(link = "logit"), data = polis) :
##
## dfb.1_ dfb.RATI dffit cov.r cook.d hat inf
## 1 0.182077 -0.007083 0.447814 1.043 5.50e-02 0.109124
## 2 0.167005 -0.141263 0.169959 1.235 6.62e-03 0.111730
## 3 -0.723849 1.079157 1.278634 0.537 8.43e-01 0.151047 *
## 4 -0.239967 0.028419 -0.546081 0.953 9.01e-02 0.108681
## 5 0.248270 -0.126175 0.359999 1.117 3.30e-02 0.110025
## 6 0.028088 -0.196986 -0.437403 1.110 5.00e-02 0.129177
## 7 0.077131 -0.102575 -0.111591 1.250 2.81e-03 0.108288
## 8 0.140334 -0.247315 -0.332565 1.242 2.65e-02 0.155414
## 9 -0.562402 0.338850 -0.723598 0.805 1.89e-01 0.112842
## 10 0.257651 -0.162838 0.319655 1.157 2.52e-02 0.114067
## 11 0.176591 -0.147771 0.180516 1.234 7.49e-03 0.113765
## 12 0.104228 -0.093408 0.104419 1.225 2.46e-03 0.090774
## 13 0.135395 -0.118138 0.136380 1.233 4.23e-03 0.102909
## 14 0.000410 -0.000476 -0.000481 1.131 5.14e-08 0.001445
## 15 0.000218 -0.000251 -0.000254 1.130 1.43e-08 0.000817
## 16 0.139447 -0.248090 -0.335881 1.239 2.70e-02 0.155114
## 17 0.143708 -0.240774 -0.311977 1.255 2.31e-02 0.156543
## 18 0.074831 -0.068694 0.074832 1.211 1.26e-03 0.075520
## 19 0.108633 -0.097001 0.108890 1.226 2.68e-03 0.092718
polis.glm %>% performance::check_model()
polis.glm %>% performance::check_outliers()
## Warning: 1 outliers detected (cases 3).
polis.glm %>%
performance::check_outliers() %>%
plot()
## These are probabilities of exceedance rather than actual Cook's D values
# https://easystats.github.io/performance/reference/check_outliers.html
polis.glm %>% performance::check_heteroscedasticity()
## OK: Error variance appears to be homoscedastic (p = 0.094).
Conclusions
polis.resid <- polis.glm %>% simulateResiduals(plot = TRUE)
Conclusions:
For simple models, we can explore a lack of fit via a goodness of fit test. Here we will do this via:
As the model was fit using maximum likelihood, it is not optimising the sum square of residuals (like it is in OLS), therefore, it is arguably not really appropriate to be judging the performance of the model based on a property that it is not directly controlling. Deviance on the other hand is directly calculated from the likelihood and thus a fairer basis for gauging the goodness of the model’s fit.
## Check the model for lack of fit via:
## Pearson chisq
polis.ss <- sum(resid(polis.glm, type = "pearson")^2)
1 - pchisq(polis.ss, polis.glm$df.resid)
## [1] 0.5715331
# No evidence of a lack of fit
# Deviance
1 - pchisq(polis.glm$deviance, polis.glm$df.resid)
## [1] 0.6514215
# No evidence of a lack of fit
Conclusions:
polis.glm %>%
augment() %>%
ggplot() +
geom_point(aes(y = .resid, x = .fitted))
There are also numerous routines and packages to support these fitted trends (partial plots).
| Package | Function | Type | Notes |
|---|---|---|---|
sjPlot |
plot_model(type = 'eff') |
Marginal means | |
effects |
allEffects() |
Marginal means | |
ggeffects |
ggeffects() |
Marginal means | calls effects() |
ggeffects |
ggpredict() |
Conditional means | calls predict() |
ggeffects |
ggemmeans() |
Marginal means | calls emmeans() |
polis.glm %>% plot_model(type = "eff", show.data = TRUE)
## $RATIO
polis.glm %>%
allEffects(residuals = TRUE) %>%
plot()
polis.glm %>%
allEffects(residuals = TRUE) %>%
plot(type = "response")
polis.glm %>%
ggpredict() %>%
plot(add.data = TRUE, jitter = FALSE)
## $RATIO
polis.glm %>%
ggemmeans(~RATIO) %>%
plot(add.data = TRUE, jitter = FALSE)
polis.glm %>% summary()
##
## Call:
## glm(formula = PA ~ RATIO, family = binomial(link = "logit"),
## data = polis)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6067 -0.6382 0.2368 0.4332 2.0986
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.6061 1.6953 2.127 0.0334 *
## RATIO -0.2196 0.1005 -2.184 0.0289 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26.287 on 18 degrees of freedom
## Residual deviance: 14.221 on 17 degrees of freedom
## AIC: 18.221
##
## Number of Fisher Scoring iterations: 6
Conclusions:
exp, log-laws dictate that they are applied multiplicatively.## on link scale (log odds)
polis.glm %>% confint()
## 2.5 % 97.5 %
## (Intercept) 1.0059470 8.04214480
## RATIO -0.4849431 -0.06648957
## or on odds (ratio) scale
polis.glm %>%
confint() %>%
exp()
## 2.5 % 97.5 %
## (Intercept) 2.7344957 3109.2748308
## RATIO 0.6157322 0.9356727
polis.glm %>% tidy(conf.int = TRUE)
polis.glm %>% tidy(conf.int = TRUE, exponentiate = TRUE)
polis.glm %>% glance()
polis.glm %>%
tidy(conf.int = TRUE) %>%
kable()
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 3.6060693 | 1.6952580 | 2.127151 | 0.0334076 | 1.0059470 | 8.0421448 |
| RATIO | -0.2195582 | 0.1005191 | -2.184245 | 0.0289443 | -0.4849431 | -0.0664896 |
# warning this is only appropriate for html output
polis.glm %>% sjPlot::tab_model(show.se = TRUE, show.aic = TRUE)
| PA | ||||
|---|---|---|---|---|
| Predictors | Odds Ratios | std. Error | CI | p |
| (Intercept) | 36.82 | 62.42 | 2.73 – 3109.27 | 0.033 |
| RATIO | 0.80 | 0.08 | 0.62 – 0.94 | 0.029 |
| Observations | 19 | |||
| R2 Tjur | 0.521 | |||
| AIC | 18.221 | |||
In simple regression, the \(R^{2}\) value (coefficient of determination) is interpreted as the amount of variation in the response that can be explained by its relationship with the predictor(s) and is calculated as the sum of squares explained divided by the sum of squares total. It is considered a measure of the strength of a relationship ans is calculated as:
\[ R^2 = 1 - \frac{\sum_i=1^N (y_i - \hat(y_i)^2)}{\sum_i=1^N (y_i - \bar(y))^2} \] where \(y_{i}\) and \(\hat{y_{i}}\) are the \(i^{th}\) observed and predicted value respectively, \(\bar{y}\) is the mean of the observed values and \(N\) is the total number of observations.
This is really only appropriate for OLS. For other models there are alternative measures (psuedo R-squared) that can be appled depending on how they are to be interpreted:
There are many different ways to calculate \(R^2\) values from GLM’s. The following table provides some guidance as to which method is appropriate for which type of model and how they should be interpreted..
One very simple calculation is based on deviance (a measure of the total amount unexplained) as:
\[ 1-\frac{Deviance}{Deviance_{NULL}} \]
where \(Deviance_{NULL}\) is the deviance of a null model (e.g. glm(PA ~ 1, data=polis, family='binomial'))
Alternatively, there are many different ways to calculate \(R^2\) values from GLM’s. The following table provides some guidance as to which method is appropriate for which type of model and how they should be interpreted..
| Model | Appropriate \(R^2\) | Formula | Interpreted as | Function |
|---|---|---|---|---|
| Logisitic | Tjur’s R2 | \(\dagger\) | performace::r2_tjur() |
|
| Multinomial Logit | McFadden’s R2 | \(\ddagger\) | 1 & 2 | performace::r2_mcfadden() |
| GLM | Nagelkerke’s R2 | \(\S\) | 2 | performace::r2_nagelkerke() |
| GLM | Likelihood ratio | Adjusted Nagelkerke - see below | MuMIn::r2.squaredLR() |
|
| Mixed models | Nakagawa’s R2 | Too complex | performace::r2_nakagawa() |
|
| Mixed models | MuMIn::r.suaredGLMM() |
|||
| ZI models | Zero-inflated R2 | Too complex | performace::r2_zeroinflated() |
|
| Bayesian models | Bayes R2 | Too complex | performace::r2_bayes() |
\(\dagger\): \(R^2=\frac{1}{n_{1}}\sum \hat{\pi}(y=1) - \frac{1}{n_{0}}\sum \hat{\pi}(y=0)\)
\(\ddagger\): \(R^2=1-\frac{logL(x)}{logL(0)}\)
\(\S\): \(R^2=\frac{1-(\frac{logL(0)}{logL(x)})^{2/N}}{1-logl(0)^{2/N}}\)
where \(n_1\) and \(n_0\) are the number of 1’s and 0’s in the response and \(\hat{\pi}\) is the predicted probability. \(logL(x)\) and \(logL(0)\) are the log-likelihoods of the fitted and null models respectively and \(N\) is the number of observations.
Note, if you run performance::r2(), the function will work out what type of model has been fit and then use the appropriate function from the above table.
# R2
1 - (polis.glm$deviance / polis.glm$null)
## [1] 0.4590197
Conclusions:
# R2 for binomial outcomes
polis.glm %>% performance::r2_tjur()
## Tjur's R2
## 0.5205851
# R2 for binomial outcomes
polis.glm %>% r2()
## $R2_Tjur
## Tjur's R2
## 0.5205851
## Likelihood ratio based
polis.glm %>% MuMIn::r.squaredLR()
## [1] 0.4700986
## attr(,"adj.r.squared")
## [1] 0.6273785
In some disciplines it is useful to be able to calculate an LD50. This is the value along the x-axis that corresponds to a probability of 50% - e.g. the switch-over point in Island perimeter to area Ratio at which the lizards go from more likely to be present to more likely to be absent. It is the inflection point.
It is also the point at which the slope (when back-transformed) is at its steepest and can be calculated as:
\[ -\frac{Intercept}{Slope} \]
# LD50
-polis.glm$coef[1] / polis.glm$coef[2]
## (Intercept)
## 16.4242
(ld50 <- -polis.glm$coef[1] / polis.glm$coef[2])
## (Intercept)
## 16.4242
## What about other points (not just 50%) along with confidence intervals..
ld <- polis.glm %>% MASS::dose.p(p = c(0.5, 0.9))
ld.SE <- attr(ld, "SE")
lds <- data.frame(
LD = attr(ld, "p"),
Dose = as.vector(ld),
SE = ld.SE
) %>%
mutate(
lower = Dose - SE * qnorm(0.975),
upper = Dose + SE * qnorm(0.975)
)
lds
Useful summary figures to accompany statistical models typically depict the modelled trend(s) overlayed onto the data used to fit the model. When there is only a single predictor, the data should be the raw data. However, when there are numerous predictors and the modelled trend represents a trend associated with one (or more predictors) marginalising over other predictor(s), then displaying the raw data may not be satisfying. The reason for this is that the modelled partial trend depicts the trend between the response and the focal predictor(s) holding the other predictor(s) constant - that is, standardising for the other predictor(s). Plotting of raw data will not reflect this standardisation.
Hence, rather than plot the raw data, we can instead plot the partial observations (partial residuals). To do so, add the (working) residuals onto predictions associated with the observed predictor values.
Unfortunately, this technique is not as straight forward for logistic regression. After adding the residuals to the predictions and back-transforming to the response scale, it is necessary to transform again to binomial.
In this case, since we just have a single predictor, we might as well just add the raw data.
## Using emmeans
polis.grid <- with(polis, list(RATIO = seq(min(RATIO), max(RATIO), len = 100)))
polis.grid <- with(polis, list(RATIO = seq_range(RATIO, n = 100)))
# OR
polis.grid <- polis %>% data_grid(RATIO = seq_range(RATIO, n = 100))
newdata <- polis.glm %>%
emmeans(~RATIO, at = polis.grid, type = "response") %>%
as.data.frame()
ggplot(newdata, aes(y = prob, x = RATIO)) +
geom_ribbon(aes(ymin = asymp.LCL, ymax = asymp.UCL), fill = "blue", alpha = 0.2) +
geom_line() +
theme_classic()
We could also include the raw data
ggplot(newdata, aes(y = prob, x = RATIO)) +
geom_ribbon(aes(ymin = asymp.LCL, ymax = asymp.UCL), fill = "blue", alpha = 0.2) +
geom_line() +
geom_point(data = polis, aes(y = PA, x = RATIO)) +
theme_classic()
Perhaps mark the boundaries of the confidence intervals with dashed lines rather than a ribbon.
ggplot(newdata, aes(y = prob, x = RATIO)) +
geom_line(aes(y = asymp.LCL), linetype = "dashed") +
geom_line(aes(y = asymp.UCL), linetype = "dashed") +
geom_line() +
geom_point(data = polis, aes(y = PA, x = RATIO)) +
theme_classic()
polis.partial <- polis.glm %>%
emmeans(~RATIO, at = polis, type = "response") %>%
as.data.frame() %>%
mutate(
Resid = stats::residuals(polis.glm, type = "response"),
Partial.obs = prob + Resid
)
ggplot(newdata, aes(x = RATIO)) +
geom_point(data = polis.partial, aes(y = Partial.obs), color = "black") +
geom_ribbon(aes(ymin = asymp.LCL, ymax = asymp.UCL), fill = "blue", alpha = 0.2) +
geom_line(aes(y = prob)) +
geom_vline(xintercept = ld50, linetype = "dashed") +
theme_classic()
# Partial will represent the fitted values plus the residuals back transformed onto the probability scale
# Partial.obs then backtransforms these onto the response scale [0,1]
polis.partial <- polis.glm %>%
emmeans(~RATIO, at = polis, type = "link") %>%
as.data.frame() %>%
mutate(
Resid = stats::residuals(polis.glm, type = "working"),
Partial = plogis(emmean + Resid),
Partial.obs = qbinom(Partial, 1, 0.5)
)
ggplot(newdata, aes(x = RATIO)) +
geom_point(data = polis.partial, aes(y = Partial), color = "gray") +
geom_point(data = polis.partial, aes(y = Partial.obs), color = "black") +
geom_ribbon(aes(ymin = asymp.LCL, ymax = asymp.UCL), fill = "blue", alpha = 0.2) +
geom_line(aes(y = prob)) +
geom_vline(xintercept = ld50, linetype = "dashed") +
theme_classic()
The modelr package takes a very specific approach to predictions and residuals, using predict() and y - predict() respectively. Hence by adding the residuals and predictions, we will recover the partial observations. Note however, the predictions will be conditional means and the partial observations will not be standardised by other predictors. Furthermore, as this routine uses predict(), confidence intervals are not always available. This routine is shown for comparison only, it is too limiting to be viable for most applications.
newdata <- polis %>%
modelr::add_predictions(polis.glm) %>%
modelr::add_residuals(polis.glm) %>%
mutate(
Partial = pred + resid,
pred = plogis(pred)
)
ggplot(newdata, aes(y = pred, x = RATIO)) +
geom_line() +
# geom_ribbon(aes(ymin=asymp.LCL, ymax=asymp.UCL), fill='blue',alpha=0.2)+
geom_point(data = polis, aes(y = PA, x = RATIO)) +
geom_point(aes(y = Partial), color = "green") +
geom_vline(xintercept = ld50, linetype = "dashed") +
theme_classic()
polis.mod <- polis %>% fit_with(glm, list(PA ~ RATIO), family = binomial())
polis.mod[[1]] %>% modelr::rsquare(polis)
## [1] -46.68859
map(polis.mod, ~ rsquare(., polis))
## [[1]]
## [1] -46.68859
modelr::add_predictions(polis, polis.mod[[1]], type = "response")